Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update README.md to include workflow diagram #1

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

carlhiggs
Copy link

@carlhiggs carlhiggs commented Apr 16, 2024

Proposed inclusion of a mermaid workflow diagram in the readme markdown to illustrate the order in which files should be run. This could be expanded to add data outputs, potentially.

This looks a bit like this (the diagram is the new bit):
image

Proposed inclusion of a mermaid workflow diagram in the readme markdown to illustrate the order in which files should be run.  This could be expanded to add data outputs, potentially.
@carlhiggs carlhiggs requested a review from StevePem April 16, 2024 03:58
@StevePem
Copy link
Contributor

StevePem commented Apr 17, 2024

Hi Carl - looks great, but needs to change in one respectplease. intervention cycling speed.R doesn't depend on baseline, so should be in a row on its own. Can we have two separate rows:
(1) baseline.R -> intervention destinations.R -> analysis.R
(2) intervention cycling speed.R

Not sure whether you think this would look best as 2 rows in a single Workflow box, or 2 separate boxes - I don't mind.

I think it makes sense to continue to say that they both import functions as required. (Currently, intervention cycling speed.R imports the functions but doesn't actually use them - however I'm not planning to change this.)

(I had a go at changing it, but it didn't look good, so I thought I'd leave it to you if that's OK.)

@carlhiggs
Copy link
Author

Hi Carl - looks great, but needs to change in one respectplease. intervention cycling speed.R doesn't depend on baseline, so should be in a row on its own. Can we have two separate rows: (1) baseline.R -> intervention destinations.R -> analysis.R (2) intervention cycling speed.R

Hi Steve, thanks for checking this out and for the clarification. Just wondering, is the output file ../output/edgesMelbourneUpdatedSpeed.gpkg likely to be used in analysis.R in the future, e.g. to evaluate its impact?

If it were, we could sketch this perhaps as,

flowchart TD
    subgraph Workflow
        direction LR 
        baseline.R --> a(intervention destinations.R)
        a --> analysis.R
        intervention.cycling.speed.R --> analysis.R
    end
    Workflow<-- imports as required ---c(functions/*.R)
Loading

Otherwise, we could do something like this,

flowchart TD
    subgraph WorkflowA[Destination allocation Intervention]
        direction LR 
        baseline.R --> a(intervention destinations.R)
        a --> analysis.R
    end
    subgraph WorkflowB[Cycling speed Intervention]
        direction LR 
        intervention.cycling.speed.R
    end
    WorkflowA<-- imports as required ---c(functions/*.R)
    WorkflowB -- imports as required ---c(functions/*.R)
Loading

what do you think?

@carlhiggs
Copy link
Author

carlhiggs commented Apr 17, 2024

I suppose the other question would be, how important/useful do you think it would be to include data in this?

It could over-complicate things, but then again, it could show which files are generated from which steps, and that could be useful

Would love to hear your thoughts on this @StevePem

@carlhiggs
Copy link
Author

carlhiggs commented Apr 17, 2024

Hi @StevePem , if we included only main code and data in the workflow diagram, and had a file like data-resources.yml that detailed input data in a standard way across projects (I'll try to provide a demo this week), we could mock this up in this kind of way:

flowchart TD
    subgraph WorkflowA[Destination allocation Intervention]
        direction LR 
        baseline.R --> a(intervention destinations.R)
        a --> analysis.R
    end
    subgraph WorkflowB[Cycling speed Intervention]
        direction LR 
        intervention.cycling.speed.R
    end
data-resources.yml --- WorkflowA 
data-resources.yml --- WorkflowB
WorkflowA --> o1[(output/intervention locations.sqlite)]
WorkflowA --> o2([output/intervention tables.xlsx])
WorkflowB --> o3([output/edgesMelbourneUpdatedSpeed.gpkg])
Loading

Maybe its not that important / useful to include all the individual data and sub-functions. Those are details that someone could check out either in data-resources.yml (or readme.MD!) and the code itself for functions.

What do you think about something high level like this to illustrate workflow and outputs?

@StevePem
Copy link
Contributor

StevePem commented Apr 17, 2024

Hi @carlhiggs - thanks for these comments. To respond (and with a few other related thoughts):

  1. It's not currently contemplated that the outputs of intervention cycling speed.R will feed into analysis.R. Rather: (1) intervention locations.sqlite (produced by intervention destinations.R) and the output of intervention cycling speed.R' will be taken up in the broader JIBE model, and (2) the outputs of analysis.R` are for the paper I'm writing with Belen and Tayebeh.
  2. So in your first comment, it's the second diagram (after 'Otherwise, we could do something like this') that is correct.
  3. I don't have a fixed view about whether to record details of the data in the diagram. Though I lean towards thinking that if we've already got a readme text description and/or a .yml file that lists out the data, it's superfluous to copy it out again in the diagram. I guess you'll want to adopt a consistent approach across the various repos.
  4. I like the final diagram in your comments, showing a collective resource for data (rather than listing it all out) and the final outputs. I need to update Atefeh's spreadsheet to show which are 'intermediate' outputs and which are 'final'. I think you'd only want to include 'final' outputs in the diagram.
  5. I'm currently thinking that the 'final' outputs from the destination allocation workflow will be intervention locations.sqlite, baseline assessment.xlsx, accessibility tables.xlsx and underutilisation tables.xlsx. So I think those are the ones that should be shown in the diagram. The others will all be intermediate. However, it's possible I could change my mind on some of them. Certainly some of the intermediate outputs are needed for visualisations that we are contemplating for various purposes.
  6. The final output from the cycling speed workflow is currently contemplated as edgesMelbourneUpdated.gpkg, with 2 layers (residential speed reduced to 40 km/h and 30 km/h). I'll be creating it from the final version of the network produced by the UK team when that's done, and the final name and format will probably match whatever name/format the UK team use. I'll make a decision on this when the final network is ready.
  7. To clarify that last point - I'm currently doing the accessibility and underutilisation analysis on the basis of the intermediate 'clipped network' (I would use the final network if it were ready, but it's not; and the final network won't differ from the clipped network in any way that is material to the analysis). Whereas the cycling speed intervention, which is very simple and will take about 5 minutes to run, needs to be based on the final network.

@carlhiggs
Copy link
Author

Sounds good @StevePem, thanks for the advice and update. Perhaps we could hold off on this for now and update once outputs are finalised, and we agree if/how data could be listed in a configuration file. Let's discuss tomorrow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants